Exploratory Data Analysis

Main focus of the EDA is to find datasets which can be used to train a machine learning model. But it is imporant to create a better understanding of the environment and sets you are dealing with. In the upcoming section various datasets will be explored and valuable features will be extracted. These will all be comprised together to create the new dataset.

Contents

1. Datasourcing

2. Datatypes

3. Sources

1. Datasourcing

Dataset 1: Traffic accidents in the Netherlands 2004

Lets import the first dataset concerning all noted traffic accidents in the Netherlands during 2004.

As you can see the dataset contains many features. Lets take a better look at it.

This is a relatively large dataset. Lets display all the features it contains.

From here we can actually get more insight into usable feature sets. These features can be split into usable groups.

Location

Situation

Vehicle

Infrastructure

Limitations

Environmental

Law enforcement

Establishing useable features

Nature of incidents

Because the nature of this research is traffic incidents, we have to know the cause. Lets take a deeper look into Aard

Here we can see the 10 main causes, of which the final one is of type unknown. Lets take a look at the distribution. But first checking for null values.

Luckily every datapoint has an established type. Lets look at the distribution.

Here we can see that Flank is the common cause for traffic incidents. This translates into accidents from the side

Another thing to note is there is a clear distinction between Kop/Staart, Frontaal and Eenzijdig. It is curious to see on what base this is established.

Kop/Staart: Frontal or rear ended crash

Frontaal: Head to head crash

Eenzijdig: Uni directional crash

Damages

An interesting combination would be to check the damages related to the type of accident.

First lets check for null values in Schade

There are some, about 30% of the total. Lets check the types.

In this case it seems very clear there is a distinction between 3 types. Yes, No and NaN.

First lets change all NaN to O and then lets incorporate this data into the previous chart.

This is a very important insight for the dataset we are dealing with, At first glance it seems that obviously almost every accident involves some kind of damage. But after converting the NaN datapoints we can tell that it is not always the case!

Vast voorwerp (Solid object) is the perfect example for this. The law enforcement officers in many cases cannot note the damages done to the vehicle because they escape the crime scene in these situations.

Gender

Lets also take a look at gender, and if this has any influence

Lets make a copy of the dataset, change the NaN to O and compare the two results in a graph.

From this we can assume that this data would not be feasible to use in these circumstances. Too many datapoints are unknown to make an accurate assumption.

Resulting damages

Finally lets take a closer look at the final conclusions from these accidents. They can be found within Afloop3

As you can see all accidents came to a conclusive report of the situation. Lets take a look at this.

We can see that luckily only a small part of each case ends in a dealdy situation. It is interesting to see what difference there was.

Location

It is possible to plot all datapoints into a map. But it is not feasible, because we know for sure these incidents happen all around the netherlands, and it will take quite some computing power to display all 160k datapoints.

Therefore we will make use of Longitude and Latitude for later comparison. Lets focus on data that can be classified for now. Lets see the provinces and their relative cases of traffic incidents.

No null values so we can continue by displaying the categorical data.

It is interesting to see that in the pie chart None is visible, while in the table it is not present. Lets investigate this further.

So it seems like there are no None, we dont need further cleaning for this. Lets continue and take a look at the type of accidents per province.

Here we can clearly see that there are 3 main reasons:

  1. Flank
  2. Kop/Staart
  3. Vast voorwerp

This is relatable to accidents seen daily.

Vehicle

First lets take a look at the type of vehicles there are.

With this we can create a new dataset about the vehicle

Here we can already see our first problem if we want to incorporate the date of the vehicle. Lets see how many of these entries are empty.

This is about 30 percent of the dataset. We will keep as is because this might be to the fact that the vehicle escaped the scene and there is no information about it. To see a distribution of we will have to change the date time.

Here we can see the format of yyyymmdd

Here we can see a surprising increase of not appropriately registered vehicles increased significantly. This could have been an influx of foreign vehicles, but it is not sure for now.

Usable Features

As we can see there is some usable information in this dataset, but lets take a look at a different one to compare the usable features.

Dataset 2: Accidents in 2015

This dataset is comprised of 2 main sets, and references. We will focus on the accidents first to see if the dataset is of any use. Lets import it.

This data has been tagged. In the downloaded file structure there are referenced features. We can easily pick some out and see what data we are dealing with. First lets get a better look of the shape

Good to see we have another potentially satisfying dataset. These features will be my main point of interest:

From this we can see that not many of the features are actually usable because they are null. Only usable features are:

  1. AOL_ID
  2. AP3_CODE
  3. WSE_ID
  4. WVG_ID

Now lets seperate these columns from the main dataframe

From this we can tell alot of the data is onbekend a.k.a. unknown. Sadly we will not be able to use this dataset because of its lack of correlatable features.

Dataset 3: Deadly accidents in the Netherlands 2006-2012

The third dataset contains information about each traffic related death in the Netherlands between 2006 and 2012. Lets import the dataset and see what usable features we can pull out.

Here we have a very nice and saturated dataset we can use. Lets display this data in a nice form. Starting with the age.

Dataset 4

Vehicle

Persons of interest

Limitations

Objects

Sources